Phylogenetic diversity statistics for all clades in a phylogeny

Abstract The classic quantitative measure of phylogenetic diversity (PD) has been used to address problems in conservation biology, microbial ecology, and evolutionary biology. PD is the minimum total length of the branches in a phylogeny required to cover a specified set of taxa on the phylogeny. A general goal in the application of PD has been identifying a set of taxa of size k that maximize PD on a given phylogeny; this has been mirrored in active research to develop efficient algorithms for the problem. Other descriptive statistics, such as the minimum PD, average PD, and standard deviation of PD, can provide invaluable insight into the distribution of PD across a phylogeny (relative to a fixed value of k). However, there has been limited or no research on computing these statistics, especially when required for each clade in a phylogeny, enabling direct comparisons of PD between clades. We introduce efficient algorithms for computing PD and the associated descriptive statistics for a given phylogeny and each of its clades. In simulation studies, we demonstrate the ability of our algorithms to analyze large-scale phylogenies with applications in ecology and evolutionary biology. The software is available at https://github.com/flu-crew/PD_stats.


Introduction
Habitat loss, biological invasions, emerging infectious diseases, and climate change are altering global ecological systems. Extinctions have vastly outpaced speciation events over the past century (Barnosky et al. 2011, Ceballos et al. 2015, and novel pathogens are emerging at an increasing rate with two pandemics within the past 20 years (Gibb et al. 2020, Keusch et al. 2022. Quantifying biodiversity is consequential as it permits assessment of when and where significant biological changes are occurring. Traditional approaches to measuring changes in diversity patterns have taken the form of generating phylogenetic trees from genetic sequence data and then extracting measures of phylogenetic diversity (PD) from the inferred trees (Cadotte et al. 2010).
A now classic measure to quantify diversity is the PD index introduced by Faith (1992). PD measures the phylogenetic history among taxa occurring in a given sample and is calculated on a rooted phylogenetic tree for a set of taxa in the tree (i.e. represented as leaves). It is defined as the sum of the edge lengths that span these taxa along with the root of the tree (Fig. 1). This index directly interprets the evolutionary history of the taxon set and has been applied to generate solutions to practical problems in ecology and evolutionary biology (Li et al. 2020). Consequently, there has been extensive computational work considering the mathematical and combinatorial properties of the index (Steel 2005, Hartmann and Steel 2006, Minh et al. 2006, Moulton et al. 2007, Bordewich and Semple 2012. Specifically, there are polynomial time algorithms to solve the maximum PD problem on phylogenetic trees (Halldó rsson et al. 1999, Steel, 2005, Minh et al. 2006, Hartmann and Steel 2006. However, PD applications (e.g. Faith and Richards 2012, Faith 2018, Faith and Simon 2018 could benefit from understanding descriptive statistics associated with PD across the entire phylogeny in addition to the maximum PD and the maximal PD set. These statistics allow for the identification of critical nodes within the phylogeny and for comparison of diversity statistics between nodes.
There has been progress toward calculating the minimum, average, and variance of PD on a phylogeny with n leaves and a subset of those leaves of size k. Hartmann et al. proposed using the minimum PD value to compute the gain in PD for the given set of taxa when compared with the worst-case scenario (i.e. a set of taxa with the smallest PD) (Hartmann and André 2013). Similarly, Manson et al. suggested using minimum PD to find the worst-case conservation scenario -i.e. find k taxa whose extinction will decrease the overall PD the most (Manson et al. 2022). Both Hartmann and André (2013) and Manson et al. (2022) provided algorithms for finding a value and a set for minimum PD. However, the complexity of the algorithm was not analyzed by Hartmann and André (2013), and the algorithm described by Manson et al. (2022) is prohibitive for larger phylogenies (and larger k) due to its Oðnk 3 Þ run time. The average PD was also explored by Hartmann et al. to compute how "beneficial" a taxon set is when compared with a randomly chosen set of k taxa (Hartmann and André 2013), i.e. to identify sets of taxa with higher than average PD. Tsirogiannis et al. (2012Tsirogiannis et al. ( , 2014 proposed algorithms to compute both average PD and variance PD. Though these approaches computed the PD statistics approximately in linear time, they do not have an approximation guarantee as they approximate ratios of binomial coefficients using the hypergeometric distribution. In this work, we introduce dynamic programming algorithms designed to efficiently and accurately compute the distribution of PD on a phylogeny by calculating the associated descriptive statistics for each clade on a given tree. Our solution for the Minimum PD problem has OðknÞ time-complexity, significantly improving on the previous algorithm by Manson et al. (2022) that runs in Oðnk 3 Þ time, where n is the number of taxa of a phylogeny and k is the user-specified size of a PD set (query size). We achieved this complexity bound by applying the technique from Halldó rsson et al. to our dynamic programming formulation (Tamir 1996, Halldó rsson et al. 1999. We extended our dynamic programming framework to the Average PD and Variance PD problems, obtaining precise algorithms that run in Oðk 2 n log 2 nÞ time. We designed exact algorithms using big-number arithmetics with binomial coefficients. Crucially, our algorithms compute PD statistics (minimum, maximum, average, and variance) not only for the overall tree but also for every clade in the tree. Moreover, given a fixed input value k, the algorithms compute PD statistics for all 2 k i k. Using a simulation study, we demonstrated that our dynamic programming framework can process large-scale trees with 10,000 taxa in under an hour on a standard PC, computing all of the PD statistics for each clade and every taxon size k, where 1 < k < 10;000.

Basics and preliminaries
We review basic definitions, statistical concepts, and terminology from computational phylogenetics that are needed in the context of this work. Once the PD is defined on a phylogenetic tree, we introduce the statistical measures of PD and finally formalize our problem of computing these statistics on all clades for a given phylogeny.

Basic definitions
A (phylogenetic) tree T :¼ ðVðTÞ; EðTÞ; wÞ, over a taxon set X, is a rooted full binary tree where each leaf is uniquely labeled with a taxon from X and edges are weighed by w: EðTÞ ! R þ . For convenience, we identify the taxa set with the leaf set, and q denotes the root of T.
A subtree induced by S X is the minimal subtree of T connecting the vertices in S [ fqg; this subtree is denoted by T S . Further, the subtree rooted at v 2 VðTÞ is denoted by T v , and the number of leaves in T v is denoted by jvj. v k is a collection of all subsets of X of size k, in the tree rooted at v, i.e. v k :¼ fS j S X \ VðT v Þ; jSj ¼ kg: Let T be a phylogenetic tree, for a subset S X, the PD of S, also referred to as PD, is For convenience, we write PD(S) when T is clear from the context.
We use wðv # Þ to denote the weight of the edge incident to v from the path between q and v. In addition, we define w(T) as the sum of the weights of all edges in T.

PD statistics and problems
We define the PD statistics and then introduce the corresponding computational problems.
To define the PD statistics let T be a phylogenetic tree over a taxon set of size n and k be an integer, where 1 k n.
• The minimum PD (MinPD) is defined as The PD statistics can be computed at every clade of T. For instance, in Fig. 1, if we restrict the tree to the subtree at node ii, then one can compute that d ð2Þ ¼ 6 (S ¼ fC; Dg), D ð2Þ ¼ 9 (S ¼ fC; Eg), a ð2Þ ¼ 7:33; and w ð2Þ ¼ 1:55.
Finally, we introduce the PD statistics problems that are based on the above-mentioned definitions.

PD problems
Given a phylogenetic tree T over a taxon set of size n and an integer k, where 1 k n, compute • MinPD Problem: d ðkÞ , and a Minimal PD set S min • MaxPD Problem: D ðkÞ , and a Maximal PD set S max Figure 1. An example of a phylogenetic tree. The internal nodes below the root are labeled by roman numerals. The phylogenetic diversity of a set of taxa fA, B, Dg is 3 þ 4 þ 1 þ 1 þ 2 þ 1 ¼ 12 as can be seen from the highlighted (bold) edges.
• AvgPD Problem: a ðkÞ • VarPD Problem: w ðkÞ The above statistics provide information about the distribution of PD on a fixed tree. We further extend the above problems to compute the aggregate statistics, not only for T overall but also for each individual clade in T. Note that for a clade at a node v, we only compute the above statistics for k jvj (for k > jvj, the corresponding values are undefined).

Solving PD Problems
We introduce efficient dynamic programing algorithms to solve the PD problems for all clades in a given tree with n leaves and an integer k. First, we prove a OðknÞ recurrence for the minimum and maximum PD problems. Then, for the first time, we describe efficient algorithms for exactly solving the average PD and variance PD problems in Oðk 2 n log 2 nÞ time.

Minimum PD problem
We define dðv; kÞ-subsolutions at subtrees and describe an OðknÞ time algorithm to compute d ðkÞ for each clade in T with taxon set X.
In a tree T, the subproblem at v 2 VðTÞ, for the MinPD problem, is defined as the problem to compute d ðkÞ for T v . Thus, for 1 k jvj in a tree rooted at a node v, we define dðv; kÞ to be the min PD Tv ðSÞ. That is, dðv; kÞ :¼ min PROPOSITION 1 For every node v 2 VðTÞ with children x and y (if they exist), and 1 k jvj: dðv; where wðx # Þ is the weight of the edge from v to x, and wðy # Þ is the weight of the edge from v to y.

PROOF.
For v being a leaf and any k (the base case), dðv; kÞ ¼ 0; since there is no diversity to be selected from the subtree rooted at a leaf. For an internal node v with children x and y, there are multiple possibilities for splitting k taxa between the subtrees rooted at x and y.
• Case I: all the k taxa in a minimal PD set S are from the subtree rooted at x. Note that for any S 2 x k , we have PD Tv ðSÞ ¼ PD Tx ðSÞ þ wðx # Þ. Therefore, set S that minimizes PD Tx ðSÞ also minimizes PD Tv ðSÞ. That is, dðv; kÞ ¼ dðx; kÞ þ wðx # Þ: • Case II: all the k taxa in a minimal PD set S are from the subtree rooted at y. Similarly to Case I, in this case dðv; kÞ ¼ dðy; kÞ þ wðy # Þ: • Case III: k taxa are split between the subtrees rooted at x and y. Let us fix some k 1 and k 2 , such that 1 k 1 jxj; 1 k 2 jyj; k 1 þ k 2 ¼ k. Then for S 1 2 x k 1 and S 2 2 y k 2 , we have PD Tv ðS 1 [ S 2 Þ ¼ PD Tx ðS 1 Þ þ PD Ty ðS 2 Þ þ wðx # Þ þ wðy # Þ: Then minimizing over all such sets S 1 and S 2 reduces to independent minimization in the left and right subtrees. It is then not difficult to see that min S2v k k2 jyj ðdðx; k 1 Þ þ dðy; k 2 ÞÞ: Combining all three above cases, we obtain the formula from Proposition 1. h Our pseudocode for the algorithm using the recurrence above is presented in Algorithm 1. for v 2 X do 4: for i ¼ 0 to k do 5: d½v ; i ¼ 0 6: end for 7: end for 8: for v 2 V nX (in post order) do 9: //x, y are children of v 10: for i ¼ 1 to minðk; jv jÞ do 11: minval ¼ 1 12: for r ¼ maxð0; i À jyjÞ to minðjxj; iÞ do 13: l i À r 14: val programming of this type in fact runs in OðknÞ time.
We explicitly summarize their results in Lemma 3.
LEMMA 3 (Tamir 1996, Halldó rsson et al. 1999) For a non-leaf node v with children x and y, let I v denote the number of pairs (l, r) such that 0 l jxj; 0 r jyj, and l þ r minðk; jvjÞ.

Maximum PD problem
The original MaxPD problem can be solved in O(n) time (Spillner et al. 2008). Then, solving this problem for every clade in a tree would require Oðn 2 Þ time. However, adapting the dynamic programming approach from the previous section, we can solve the same problem in OðknÞ time, which is more desirable, especially for smaller k. The adaptation would only require changing the minimization criterion in Algorithm 1 to maximization. Additionally, it might be possible to extend the algorithm from Spillner et al. (2008) to achieve a similar runtime.

Average PD problem
We begin by defining AvgPD subproblems and then prove the recurrence relation for computing the bðv; kÞ-subsolution at T v (defined below). We then present an Oðk 2 n log 2 nÞ time and Oðk 2 n log nÞ space algorithm to compute a ðkÞ for each clade in T. We define b ðkÞ :¼ P S2q k PDðSÞ. That is, b ðkÞ ¼ a ðkÞ Á jXj k . We then define the subproblem bðv; kÞ at a vertex v 2 VðTÞ, for the AvgPD problem, as the problem to compute b ðkÞ in T v subtree. That is, for 1 k jvj, bðv; kÞ :¼ X bðx; kÞ þ jxj k wðx # Þ þ ½bðy; kÞ þ jyj k wðy # Þ # PROOF. For v 2 X; bðv; 1Þ ¼ PD Tv ðfvgÞ ¼ 0. We now prove the recurrence for internal v with children x and y. We split bðv; kÞ ¼ P S2v k PD Tv ðSÞ into three categories.
• Case I : all the k taxa in S are chosen from the subtree rooted at x, i.e. S 2 x k . Then, b I ðv; kÞ ¼ X S2x k ðPDðSÞ þ wðx # ÞÞ ¼ bðx; kÞ þ jxj k wðx # Þ: • Case II : all the k taxa in S are chosen from the subtree rooted at y. Similarly to Case I, we have b II ðv; kÞ ¼ bðy; kÞ þ jyj k wðy # Þ: • Case III : k taxa are split between subtrees rooted at x and y. Let k 1 and k 2 be such that 1 k 1 jxj; 1 k 2 jyj; k 1 þ k 2 ¼ k. Then we define bðv; k 1 ; k 2 Þ :¼ X It is then not difficult to see that Next, observe the following bðv; k 1 ; k 2 Þ ¼ X S1 2 x k1 S2 2 y k2 where aðv; kÞ is the solution for the the problem to compute a ðkÞ in T v . Our pseudo code for the algorithm using the recurrence above is presented above. Grover et al.
PROOF. The analysis of the complexity of Algorithm 2 is similar to the analysis of Algorithm

Variance PD problem
Recall that by definition, Using Algorithm 2, we can compute a ðkÞ in Oðk 2 n log nÞ time. Therefore, we are only left to compute the sum of squares P S2q k PDðSÞ 2 , which we denote by c ðkÞ . For convenience, we also define cðv; kÞ :¼ X S2v k PD Tv ðSÞ 2 ; for any v 2 VðTÞ and 1 k jvj.
PROOF. For a leaf v 2 X; cðv; 1Þ ¼ PD Tv ðfvgÞ 2 ¼ 0. We now prove the recurrence for an internal vertex v with children x and y. We split cðv; kÞ into three categories.
• Case I: all the k taxa in S are chosen from the subtree rooted at x, i.e. S 2 x k . Then, Algorithm 2 AvgPD Algorithm 1: procedure COMPUTEAVG(T, k) 2: Precompute i j values for i n; j k via a Pascal triangle.
PROOF. The proof is similar to the proof of Theorem 5. h

Experimental evaluation
We demonstrate the applicability of the PD statistics algorithms by evaluating their scalability. Additionally, we compare our exact algorithms for calculating AvgPD and VarPD to calculating these measures through randomly sampling taxon sets. The algorithms were implemented in Python 3 and executed on a PC with an Intel Core i7 3.19 Ghz processor using 8.0 GB of RAM under the Windows 10 operating system. Dendropy (Sukamaran and Holder 2010) and Biopython (Cock et al. 2009) were used in our implementation.

Scalability analysis
To demonstrate the applicability of our algorithms, we generated large-scale phylogenetic trees using the birth/death model (1.0 birth rate and 0.5 death rate) with the number n of taxa ranging from 1000 to 10,000 and a step of 1000. For each fixed number of taxa, we generated 10 trees. The PD set size k was set to match the total number of taxa n to measure the worst-case complexity of our algorithms. Thus, we compute the PD statistics for all clades in a tree and all k ranging from 2 to n À 1. Figure 2 demonstrates that for a 10,000 taxon tree, all of the PD statistics on all clades (and all k) can be computed within an hour.

AvgPD and VarPD algorithms are significantly faster than random sampling
When exact algorithms for computing the mean/variance of a distribution are not known, random sampling is typically used to approximate those moments. Here we compare the runtime of our methods for AvgPD and VarPD problems against a random sampling strategy.
We simulated a birth/death tree (1.0 birth rate and 0.5 death rate) with n ¼ 500 taxa and tested different k between 50 and 450 with a step of 50. After computing the exact AvgPD and VarPD values using our algorithms, we sampled subsets of taxa of size k uniformly at random and computed the PD of the chosen samples. After every 200 samples, we computed the overall mean/variance PD of the samples until that value converged to the truth within 0.1%. Below, we report the average runtimes of the algorithms over various k.
For the AvgPD problem, random sampling required 14.2 s to converge, while our algorithm required 2 s to compute, on average. The difference was more significant for the VarPD problem, where random sampling required 28 min to converge (on average), while our algorithm required only 2.8 s. Further, for AvgPD, the gap between our algorithm and random sampling increases with larger n. For example, for n ¼ 2000 and k ¼ 20, random sampling required >1 h to converge, while our algorithm required only 6.5 s to complete.

Conclusion
In this work, we devised effective, efficient algorithms for computing the PD statistics. Notably, our algorithms are not restricted to calculating these values for the entire tree; but for every subtree/clade present in T. Furthermore, our algorithms provide descriptive statistics of PD, including the minimum PD, maximum PD, average PD, and variance PD for all values ranging from 1 up to k. A major merit of our algorithms for the minimum and maximum PD is that these values can be computed simultaneously by building on a similar dynamic programming approach. All the algorithms described in this work are efficient and scale well in practice. Consequently, these algorithms can be applied in the field without limitation based on the number of taxa or k.
Quantifying statistics such as the average PD and variance PD across all nodes in a phylogeny can provide deep insights into evolutionary histories. The average PD is the mean PD value over all sets of k taxa on a tree. In other words, it is the expected PD value if taxa were chosen uniformly at random. Similarly, the variance PD is the variance of the PD values given an equiprobable distribution of selecting any k taxa. Standardizing these diversity measures is crucial, making values comparable for different values of k and across different phylogenies. Further, when computed for all taxa in a phylogeny, these statistics can be used to assess the role of evolutionary history in shaping ecological communities, to determine how this evolutionary history influences niche and resource use, and can be used to study trait evolution and biogeography (Webb et al. 2002).